class: center, middle, inverse, title-slide # Bayesian non-parametric methods for clustering ##
Part I: Unsupervised clustering with Dirichlet process mixtures
### Boris Hejblum & Anaïs Rouanet ### April 7
th
, 2021 --- ## Instructors .pull-left[ .center[] .center[**Boris Hejblum**] ] .pull-right[ <img src="data:image/png;base64,#accessories/figures/AnaisRouanet.jpeg" width="244px" style="display: block; margin: auto;" /> .center[**Anaïs Rouanet**] ] --- ## Class schedule - .dark-red[9h-10h: Lecture part I] 🤓 - .purple[10h-10h15: _break_] 🥳 - .blue[10h15-10h45: practicals 1] 🧑💻 - .dark-red[10h45-11h15: Lecture part II] 🤓 - .purple[11h15-11h30: _break_] 🥳 - .blue[11h30-12h30: practicals 2] 🧑💻 --- ## Acknowledgments & ressources <br> #### Inspirations - François Caron short-courses - [Julyan Arbel slides](https://www6.inrae.fr/mia-paris/content/download/4604/43001/version/1/file/ArbelRochebrune.pdf) <br> #### Useful resources: - [Peter Orbanz lecture notes](http://www.gatsby.ucl.ac.uk/~porbanz/papers/porbanz_BNP_draft.pdf) - [Botond Szabo slides](https://people.eecs.berkeley.edu/~jordan/courses/294-fall09/lectures/nonparametric/slides.pdf) - [Yee Whye Teh encyclopedia entry](https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-30164-8_219) --- class: inverse center middle # Introduction --- ## Simple mixture model example .pull-left[ 1-dimensional normal mixture model with 2 components: `\(x_{i}\in\mathbb{R}\)` with `\(i=1,\ldots,n\)` observations `\(x_i \overset{\scriptsize\text{ iid}}{\sim} \pi_1 \mathcal{N}(\mu_1, \sigma_1) + (1 - \pi_1) \mathcal{N}(\mu_2, \sigma_2)\)` ] .pull-right[ .center[ <img src="data:image/png;base64,#accessories/figures/ExNmixture_millions_wide.png" alt="drawing" width="100%"/> ] [Lin *et al.* *WIREs Comp. Stat.* 2020] ] --- ## Mixture model `\(\boldsymbol{x}_{i}\in\mathbb{R}^{d}\)` with `\(i=1,\ldots,n\)` observations Let's assume `\(\boldsymbol{x}_{i}\overset{\scriptsize\text{iid}}{\sim}F\)`, with `\(\displaystyle F(\boldsymbol{x})=\int_{\boldsymbol{\Theta}}f_{\boldsymbol{\theta}}(\boldsymbol{x})d\mathcal{A}(\boldsymbol{\theta})\)` - `\(f_{\boldsymbol{\theta}}(\boldsymbol{x})\)` is the known parametric probability density function of a component e.g. multivariate Gaussian `\(\boldsymbol{\theta}=\left(\boldsymbol{\mu}, \boldsymbol{V}\right)\)`) - `\(\mathcal{A}\)` the mixing distribution -- .pull-left[**Finite mixture model:** `$$\mathcal{A}=\sum_{g=1}^G\pi_g\delta_{\boldsymbol{\theta}_g}$$` ] .pull-right[**Infinite (nonparametric) mixture model:** `$$\mathcal{A}=\sum_{g=1}^{\infty}\pi_g\delta_{\boldsymbol{\theta}_g}$$` ] <br> `\(\theta_g\)`: parameters of cluster `\(g\)` `\(\delta_x\)`: mass probability function in `\(x\)` --- ## You said "Non-parametric" ?! .pull-left[ - **Non-parametric models** - _Do have parameters_ - Can be understood as having an infinite number of parameters - Number of parameters can grow with the dataset (_dependant_) <br> - **VS parametric models** - _Finite_ and _fixed number_ of parameters - Number of parameters independent of the dataset ] -- .pull-right[ Model complexity of `\(\{M_\theta : \theta \in \Theta\}\)` | | **Parametric** | **Nonparametric** | |---|----------|---------------| | **Dimension** | _Finite_ | _Infinite_ | | **Pros** | _Interpretation_ & Computationally faster | _Flexibility_ & Less chance for misspecification | | **Cons** | Easy to _misspecify_ | Computationally & Analyticaly challenging | ] --- ## Example of nonparametric flexibility <img src="data:image/png;base64,#BNPclustering_part1_files/figure-html/unnamed-chunk-2-1.png" width="100%" /> --- class: inverse center middle # Dirichlet Process mixture model (DPMM) --- ### Nonparametric Bayesian models for unsupervised clustering **Dirichlet Process (DP) mixture model** <br> ➡️ `\(DP\)` _prior_ on the mixing distribution: <br> <br> `\begin{align} \mathcal{A}\,\big|\,\alpha, \mathcal{A}_0 &\sim\text{DP}(\alpha,\mathcal{A}_{0})& \\ \boldsymbol{\theta}_{i}\,\big|\,\mathcal{A} &\sim \mathcal{A} \quad & \text{for }i=1,\ldots,n \\ x_{i}\,\big|\,\boldsymbol{\theta}_{i} &\sim f_{\boldsymbol{\theta}_{i}}\quad & \text{for }i=1,\ldots,n \end{align}` <br> <br> - `\(\mathcal{A}_0\)`: _base distribution_ parameter on the distribution of clusters' location & shape - `\(\alpha\)`: _concentration_ parameter influencing the distribution on the number of clusters -- <br> **Posterior distribution:** `$$\textstyle \mathcal{A}|\theta_{1:n}\sim DP\left(\alpha + n, \alpha \mathcal{A}_0 + \sum_{i=1}^n\delta_{\theta_i} \right)$$` --- ## Dirichlet process — definition .bg-washed-blue.b--blue.ba.bw2.br3.shadow-5.ph4[ **A Dirichlet Process (DP) is a probability distribution over the space of probability measures** whose marginal distributions are Dirichlet distributed: <br> If `\(\mathcal{A}\sim\text{DP}(\alpha, \mathcal{A}_0)\)`, `\(\forall\)` partition `\(A_1,\dots,A_G\)` of a measurable space `\(\mathcal{T}\)`: `$$\left(\mathcal{A}(A_1),\dots ,\mathcal{A}(A_G)\right)\sim\text{Dir}\left(\alpha \mathcal{A}_0(A_1),\dots ,\alpha \mathcal{A}_0(A_G)\right)$$` ] -- <br> `\(\text{DP}(\alpha, \mathcal{A}_0)\)`: - `\(\mathcal{A}_0\)`: **base distribution** — interpreted as the _mean of the DP_ <br> `\(\displaystyle\mathbb{E}\left[\mathcal{A}(A)\right]=\mathcal{A}_0(A)\)` <br> - `\(\alpha>0\)`: **concentration parameter** — interpreted as _inverse variance of the DP_ <br> `\(Var\left[\mathcal{A}(A)\right]=\dfrac{\mathcal{A}_0(A)(1-\mathcal{A}_0(A))}{\alpha+1}\)` --- ## The Dirichlet distribution `\(\boldsymbol{\pi}\sim\text{Dir}\left(a_1,\dots,a_G\right)\)` with `\(\boldsymbol{\pi}=\left(\pi_1,\dots,\pi_G\right)\)` <br> - **Probability distribution over the simplex:** `\(\boldsymbol{\pi} > 0 \text{ and } \sum_{g=1}^{G}\pi_g=1\)` <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#accessories/figures/Simplex_ex3.png" alt="[from Sue Liu]" width="1200px" /> <p class="caption">[from Sue Liu]</p> </div> - **Density:** `\(B(\alpha) \prod_{g=1}^G\pi_g^{a_g-1}\)` with `\(\alpha=\sum_{g=1}^G a_g\)` <br> - `\(\mathbb{E}[\pi_g]=\frac{a_g}{\alpha}\)` and `\(Var[\pi_g]=\frac{a_g(\alpha-a_g)}{\alpha^2(1+\alpha)}\)` --- ## Dirichlet process — a few properties - The `\(\alpha\)` parameter from `\(\text{DP}(\alpha,\mathcal{A}_{0})\)` _influences `\(G_n\)`_, the **number of non-empty clusters** given `\(n\)` observations _a priori_: `\begin{align*} \mathbb{E}\big[G_n | \alpha, n \big] & =\sum_{i=0}^{n-1}\frac{\alpha}{\alpha+i}\quad\simeq\,\alpha\ln\left(1+\frac{n}{\alpha}\right)\,\text{ for } n,\alpha >>0\\ \text{Var}\big[G_n | \alpha, n \big] & =\sum_{i=0}^{n-1}\frac{\alpha i}{(\alpha+i)^{2}} \end{align*}` <br> - The Dirichlet process conjugates with the Dirichlet distribution (finite) for the clustering likelihood ➡️ **predictive distribution** of `\(\boldsymbol{\theta}_{n+1}\)` knowing `\(\boldsymbol{\theta}_{i:n}\)` is: `$$\boldsymbol{\theta}_{n+1} | \boldsymbol{\theta}_{1:n}\sim\frac{\alpha}{\alpha+n}\mathcal{A}_{0}+\frac{1}{\alpha +n}\sum_{i=1}^n\delta_{\boldsymbol{\theta}_{i}}$$` (_Polya_ or _Blackwell-MacQueen_ urn scheme) --- ## Chinese Restaurant Process (CRP) <img src="data:image/png;base64,#accessories/figures/CRP_online.gif" width="700px" style="display: block; margin: auto;" /> --- ## CRP representation for the `\(DP\)` .pull-left[  ] .pull-right[ **Culinary metaphor for the `\(DP\)`** `$$\boldsymbol{\theta}_{G+1} | \boldsymbol{\theta}^*_{1:G}\sim\frac{\alpha}{\alpha+n}\mathcal{A}_{0}+\frac{1}{\alpha +n}\sum_{i=g}^G n_g\delta_{\boldsymbol{\theta}^*_{g}}$$` where `\(\boldsymbol{\theta}^*\)` are the unique values `\(\boldsymbol{\theta}_{1:n}\)` - _rich get richer_ (self reinforcing) - probability distribution induced over the tables is **a random draw from a `\(DP(\mathcal{A}_0, \alpha)\)`** ] --- ## Stick-breaking representation A **draw from a `\(DP(\mathcal{A}_0, \alpha)\)`** is _almost surely_ **discrete** and can be written `\(\displaystyle \mathcal{A}=\sum_{k=1}^{\infty}\pi_{g}\delta_{\boldsymbol{\theta}^*_{g}}\)` with: - `\(\displaystyle\pi_{1}=\nu_{1} \quad\)` and `\(\quad \pi_{g}=\nu_{g}\prod_{j=1}^{g-1}\left(1-\pi_{j}\right)\qquad \text{with} \quad \nu_{g} \overset{\scriptsize\text{i.i.d.}}{\sim} \text{Beta}(1,\alpha)\)` - `\(\boldsymbol{\theta}^*_{g} \overset{\scriptsize\text{i.i.d.}}{\sim} \mathcal{A}_{0} \quad\)` and `\(\quad \boldsymbol{\theta}^*_{g} \perp \pi_{g}\)` <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#accessories/figures/Stick.gif" alt="Recursively break a stick of length 1" width="700px" /> <p class="caption">Recursively break a stick of length 1</p> </div> --- ## Dirichlet process mixture .pull-left[ `\(\displaystyle F(\boldsymbol{x})=\int_{\boldsymbol{\Theta}}f_{\boldsymbol{\theta}}(\boldsymbol{x})d\mathcal{A}(\boldsymbol{\theta})\)` ] .pull-right[ <img src="data:image/png;base64,#accessories/figures/Mix.gif" width="100%" style="display: block; margin: auto;" /> ] --- ## Dirichlet process mixture `\(F(\boldsymbol{x})=\int_{\boldsymbol{\Theta}}f_{\boldsymbol{\theta}}(\boldsymbol{x})d\mathcal{A}(\boldsymbol{\theta})\)` <img src="data:image/png;base64,#accessories/figures/Mix_all4_poster.png" width="550px" style="display: block; margin: auto;" /> --- ## Impact of the concentration parameter `\(\alpha\)` .pull-left[ `$$\begin{align*} \mathcal{A}\,\big| \,\alpha, \mathcal{A}_0 &\sim DP(\alpha,\mathcal{A}_{0}) \\ \boldsymbol{\theta}^*\,\big| \,G & \sim G \end{align*}$$` - `\(\mathcal{A}_0\)` ≃ location of `\(\boldsymbol{\theta}^*\)`s - `\(\alpha~~\)` ≃ inverse variance of the weights (i.e. number of `\(\boldsymbol{\theta}^*\)`s with "non-zero" weights) ➡️ **hyperprior:** `\(\alpha\sim \text{Gamma}(a, b)\)` [[West _ISDS discussion paper_ Duke Univ. 1992]](http://www2.stat.duke.edu/~mw/MWextrapubs/West1992alphaDP.pdf) ➡️ **posterior:** `\(\alpha | G_n \,\sim\, \text{Gamma}\left(a + G_n - 1,\, b + 0.58 + \log(n)\right)\)` (Poisson approximation of the number of non-empty clusters `\(G_n\)`) .small[0.58=Euler's constant] [[Miller & Harrison _JMLR 2014]](https://jmlr.org/papers/v15/miller14a.html) showed for a fixed `\(\alpha\)`, a DPM is misspecified and will overestimate `\(G_n\)` (due to its infinite cluster number assumption) ➡️ need to fit `\(\alpha\)` to the data with a hyperprior ] .pull-right[ <img src="data:image/png;base64,#accessories/figures/ExDPM_alpha2.png" width="330px" style="display: block; margin: auto;" /> <img src="data:image/png;base64,#accessories/figures/ExDPM_alpha20.png" width="330px" style="display: block; margin: auto;" /> ] --- ## DP mixture of Normal distributions **Gaussian Dirichlet Process Mixture Model:** `$$\begin{align*} \alpha | a,b &\sim \text{Gamma}(0.001, 0.001)\\ \boldsymbol{\pi}\,\big|\,\alpha &\sim \text{GEM}(\alpha)\\ \boldsymbol{\mu}_g, \boldsymbol{V}_g,&\sim \mathcal{A}_0 & \forall\, g\\ c_i\,\big|\,\boldsymbol{\pi}&\sim \text{Mult}(\boldsymbol{\pi}) & \text{for } i=1,\ldots,n\\ \boldsymbol{x}_{i}\,\big|\,c_i, \left\{\boldsymbol{\mu}_g, \boldsymbol{V}_g\right\} &\sim\mathcal{N}\left( \boldsymbol{\mu}_{c_i}, \boldsymbol{V}_{c_i}\right)\qquad&\text{for } i=1,\ldots,n \end{align*}$$` <br> `\(GEM\)`: Griffiths-Engen-McCloskey distribution [[Hirth _Bernouilli_ 1997]](https://projecteuclid.org/journals/bernoulli/volume-3/issue-2/A-Poisson-approximation-for-the-Dirichlet-law-the-Ewens-sampling/bj/1177526730.full) --- class: inverse center middle # Computation and inference --- ## Sampling from the posterior - **MCMC methods** sample `\(\boldsymbol{\theta}^{(j)}_{1:n}\)` (with `\(j=1,\dots,N\)` MCMC iterations) asymptotically `\(_{(N\rightarrow\infty)}\)` distributed from marginal posterior: `\(\displaystyle p\left(\boldsymbol{\theta}_{1:n} | \boldsymbol{x}_{1:n} \right)\)` #### Marginal approaches easy in conjugate case #### Advanced algorithms with conditional approaches - **Slice sampler** [[Kalli _et al._, _Statistics and Computing_, 2011]](https://link.springer.com/article/10.1007/s11222-009-9150-y) - **Partially collapsed Gibbs sampler** [[Caron _et al._, _Annals of Applied Statistics_, 2014]](http://dx.doi.org/10.1214/14-AOAS717) --- ## MCMC algorithm 1 #### Polya urn representation — Gibbs sampler .pull-left[ 1. Initialize `\(\theta_{1:n}^{(1)}\)` 2. For `\(j=2,\dots, N\)` - For `\(i=1, \dots, n\)` sample `\(\theta_{i}^{(j)} \, |\, \theta_{1:i-1}^{(j)}, \theta_{i+1:n}^{(j-1)}, x_{1:n}\)` <br> [[Neal 2000]](https://www.tandfonline.com/doi/abs/10.1080/10618600.2000.10474879) ] .pull-right[ <img src="data:image/png;base64,#accessories/figures/BMQgibbs_alpha2.gif" width="500px" style="display: block; margin: auto;" /> ] --- ## MCMC algorithm 2 #### CRP representation Uses latent allocations `\(c_i\)` .pull-left[ At each iteration `\(j\)`: 1. For `\(i=1,\dots,n\)` sample `\(c_i\,|\,c_{-i}, x_i, \{\boldsymbol{\theta}^*_g\}\)` If `\(c_i\)` takes a new value, sample `\(\boldsymbol{\theta}^*_{c_i}\)` from `\(\mathcal{A}_0\)` 2. For `\(g \in \{c_1, \dots, c_n\}\)` sample `\(\boldsymbol{\theta}^*_g \,|\, \{x_{c_i=g}\}\)` <br> [[Neal 2000]](https://www.tandfonline.com/doi/abs/10.1080/10618600.2000.10474879) ] .pull-right[ <img src="data:image/png;base64,#accessories/figures/CRPgibbs_alpha2.gif" width="500px" style="display: block; margin: auto;" /> ] --- ## NPflow: partially collapsed Gibbs sampler <br> .pull-left[ For Gaussian clusters at iteration `\(j\)`: 1. Sample `\(\alpha\)`, _given the previous partition_ `\(c_{1:n}\)` 2. Sample `\(\mathcal{A}\)` _given_ `\(\alpha\)`, `\(\{\boldsymbol{\mu}_{g}\}\)`, `\(\{\boldsymbol{V}_{g}\}\)`, _and_ `\(\mathcal{A}_0\)` 3. For `\(i=1,\dots,n\)` sample the cluster allocation `\(c_i\)`, _given_ `\(\mathcal{A}\)`, `\(\{\boldsymbol{\mu}_{g}\}\)`, _and_ `\(\{\boldsymbol{V}_{g}\}\)` 4. Sample `\(\{\boldsymbol{\mu}_{g}\}\)` and `\(\{\boldsymbol{V}_{g}\}\)`, _given_ `\(G\)` _and the updated partition_ `\(c_{1:n}\)` with `\(\boldsymbol{\theta^*}_g = (\boldsymbol{\mu}_g, \boldsymbol{V}_g)\)` [[Hejblum _et al._ _Ann App Stat_ 2019]](http://dx.doi.org/10.1214/18-AOAS1209) ] .pull-right[ <img src="data:image/png;base64,#accessories/figures/DPMMGibbsNEx_300.gif" width="500px" style="display: block; margin: auto;" /> ] --- ## Clustering estimation .pull-left[ ### Partition point estimate Bayesian loss function (decision theory) - Binder loss function (optimizing Rand index) - `\(F\)`-measure - other cost function ... <br> ### Clustering uncertainty **Posterior similarity matrix:** marginal pairwise co-clustering posterior probabilities ] .pull-right[ <img src="data:image/png;base64,#accessories/figures/PointEst.png" width="350px" style="display: block; margin: auto;" /> <img src="data:image/png;base64,#accessories/figures/HMoverlapST_noDend.jpg" width="300px" style="display: block; margin: auto;" /> ] --- ## Label switching Issue with the stick breaking representation of the `\(DP\)`: _cluster labels are ordered by expected size `\(n_g\)` !_ <br> #### Solutions: - label switching moves - partition representations and updates <br> <br> _NB:_ in practice, hard to propose stable new clusters ➡️ rather initialize with a large number of clusters that will gradually downsize --- ## R implementations - [`BNPmix`](https://CRAN.R-project.org/package=BNPmix) - [`dirichletprocess`](https://CRAN.R-project.org/package=dirichletprocess) - [`NPflow`](https://CRAN.R-project.org/package=NPflow) - used in other packages such as [`PReMiuM`](https://CRAN.R-project.org/package=PReMiuM)